Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: aibek Date: Jan 25, 2014 2:59am
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

> And does it mean I will have to download every instance of a page and hash it myself to remove duplicates?

As you may have guessed, downloading all instances of a webpage, and hashing them yourself, would be worse than relying on the CDX digest. That is because all the instances of the webpage are guaranteed to be different, because the Wayback Machine replaces all links by internal hyperlinks. These urls contain timestamps, and the timestamps obviously differ.

You could however try identifying all these internal links, and delete them, before computing the hash. Perhaps simply deleting all the http://web.archive.org/web/TIMESTAMP/ parts of all urls would do.

This post was modified by aibek on 2014-01-25 10:59:00